7 research outputs found
Exploiting pitch dynamics for speech spectral estimation using a two-dimensional processing framework
Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2008.Includes bibliographical references (p. 133-135).This thesis addresses the problem of obtaining an accurate spectral representation of speech formant structure when the voicing source exhibits a high fundamental frequency. Our work is inspired by auditory perception and physiological modeling studies implicating the use of temporal changes in speech by humans. Specifically, we develop and evaluate signal processing schemes that exploit temporal change of pitch as a basis for high-pitch formant estimation. As part of our development, we assess the source-filter separation capabilities of several two-dimensional processing schemes that utilize both standard spectrographic and auditory-based time-frequency representations. Our methods show quantitative improvements under certain conditions over representations derived from traditional and homomorphic linear prediction. We conclude by highlighting potential benefits of our framework in the particular application of speaker recognition with preliminary results indicating a performance gender-gap closure on subsets of the TIMIT corpus.by Tianyu Tom Wang.S.M
Toward an interpretive framework of two-dimensional speech-signal processing
Thesis (Ph. D.)--Harvard-MIT Division of Health Sciences and Technology, 2011.Cataloged from PDF version of thesis.Includes bibliographical references (p. 177-179).Traditional representations of speech are derived from short-time segments of the signal and result in time-frequency distributions of energy such as the short-time Fourier transform and spectrogram. Speech-signal models of such representations have had utility in a variety of applications such as speech analysis, recognition, and synthesis. Nonetheless, they do not capture spectral, temporal, and joint spectrotemporal energy fluctuations (or "modulations") present in local time-frequency regions of the time-frequency distribution. Inspired by principles from image processing and evidence from auditory neurophysiological models, a variety of twodimensional (2-D) processing techniques have been explored in the literature as alternative representations of speech; however, speech-based models are lacking in this framework. This thesis develops speech-signal models for a particular 2-D processing approach in which 2-D Fourier transforms are computed on local time-frequency regions of the canonical narrowband or wideband spectrogram; we refer to the resulting transformed space as the Grating Compression Transform (GCT). We argue for a 2-D sinusoidal-series amplitude modulation model of speech content in the spectrogram domain that relates to speech production characteristics such as pitch/noise of the source, pitch dynamics, formant structure and dynamics, and offset/onset content. Narrowband- and wideband-based models are shown to exhibit important distinctions in interpretation and oftentimes "dual" behavior. In the transformed GCT space, the modeling results in a novel taxonomy of signal behavior based on the distribution of formant and onset/offset content in the transformed space via source characteristics. Our formulation provides a speech-specific interpretation of the concept of "modulation" in 2-D processing in contrast to existing approaches that have done so either phenomenologically through qualitative analyses and/or implicitly through data-driven machine learning approaches. One implication of the proposed taxonomy is its potential for interpreting transformations of other time-frequency distributions such as the auditory spectrogram which is generally viewed as being "narrowband"/"wideband" in its low/high-frequency regions. The proposed signal model is evaluated in several ways. First, we perform analysis of synthetic speech signals to characterize its properties and limitations. Next, we develop an algorithm for analysis/synthesis of spectrograms using the model and demonstrate its ability to accurately represent real speech content. As an example application, we further apply the models in cochannel speaker separation, exploiting the GCT's ability to distribute speaker-specific content and often recover overlapping information through demodulation and interpolation in the 2-D GCT space. Specifically, in multi-pitch estimation, we demonstrate the GCT's ability to accurately estimate separate and crossing pitch tracks under certain conditions. Finally, we demonstrate the model's ability to separate mixtures of speech signals using both prior and estimated pitch information. Generalization to other speech-signal processing applications is proposed.by Tianyu Tom Wang.Ph.D
Towards co-channel speaker separation BY 2-D demodulation of spectrograms
This paper explores a two-dimensional (2-D) processing approach for co-channel speaker separation of voiced speech. We analyze localized time-frequency regions of a narrowband spectrogram using 2-D Fourier transforms and propose a 2-D amplitude modulation model based on pitch information for single and multi-speaker content in each region. Our model maps harmonically-related speech content to concentrated entities in a transformed 2-D space, thereby motivating 2-D demodulation of the spectrogram for analysis/synthesis and speaker separation. Using a priori pitch estimates of individual speakers, we show through a quantitative evaluation: 1) Utility of the model for representing speech content of a single speaker and 2) Its feasibility for speaker separation. For the separation task, we also illustrate benefits of the model's representation of pitch dynamics relative to a sinusoidal-based separation system.United States. Dept. of Defense. Air Force (Contract FA8721-05-C-0002
High-Pitch Formant Estimation by Exploiting Temporal Change of Pitch
This paper considers the problem of obtaining an accurate spectral representation of speech formant structure when the voicing source exhibits a high fundamental frequency. Our work is inspired by auditory perception and physiological studies implicating the use of pitch dynamics in speech by humans. We develop and assess signal processing schemes aimed at exploiting temporal change of pitch to address the high-pitch formant frequency estimation problem. Specifically, we propose a 2-D analysis framework using 2-D transformations of the time-frequency space. In one approach, we project changing spectral harmonics over time to a 1-D function of frequency. In a second approach, we draw upon previous work of Quatieri and Ezzat , , with similarities to the auditory modeling efforts of Chi , where localized 2-D Fourier transforms of the time-frequency space provide improved source-filter separation when pitch is changing. Our methods show quantitative improvements for synthesized vowels with stationary formant structure in comparison to traditional and homomorphic linear prediction. We also demonstrate the feasibility of applying our methods on stationary vowel regions of natural speech spoken by high-pitch females of the TIMIT corpus. Finally, we show improvements afforded by the proposed analysis framework in formant tracking on examples of stationary and time-varying formant structure.United States. Dept. of Defense (Air Force Contract FA8721 05 C 0002
Recommended from our members
Delayed Low-Intensity Extracorporeal Shock Wave Therapy Ameliorates Impaired Penile Hemodynamics in Rats Subjected to Pelvic Neurovascular Injury
BackgroundErectile dysfunction (ED) caused by pelvic neurovascular injury (PNVI) is often refractory to treatment. In many cases, erectogenic therapy is administered in a delayed fashion.AimTo evaluate penile hemodynamic effects and histologic changes associated with delayed low-intensity extracorporeal shock wave therapy (Li-ESWT) after PNVI ED in a rat model. We visualized images using immunofluorescence and 3-dimensional imaging of solvent-cleared organs (3DISCO), a novel imaging technique.MethodsA total of 32 Sprague-Dawley male rats aged 12 weeks were divided equally into 4 groups: sham surgery as normal controls (NC), PNVI controls (PC), PNVI with very-low-energy Li-ESWT (PVL), and PNVI with low-energy Li-ESWT (PL). Bilateral cavernous nerve crush and internal pudendal bundle ligation were performed in the 3 PNVI groups. Li-ESWT was administered twice a week for 4 weeks in the PL and PVL groups starting at 4 weeks after PNVI.OutcomesIntracavernous pressure (ICP) studies (normalized to mean arterial pressure [MAP]) were conducted in all subject animals. After testing, tissue was harvested for immunofluorescence staining and 3DISCO analysis.ResultsMean ICP/MAP was lower in PC animals compared with NC animals (0.37 ± 0.03 vs 0.91 ± 0.03, respectively; P = .001). The ICP/MAP ratio was significantly higher in PVL and PL animals (0.66 ± 0.07 and 0.82 ± 0.05, respectively) compared with PC animals (P = .002 and .001, respectively). Detailed microstructures and trajectories of nerves and vessels were identified with immunofluorescence and 3DISCO. The PC group had lower density of nerves, axons, neuronal nitric oxide synthase-positive nerves, and Schwann cells in the dorsal penis. Animals in the PL group had significantly higher expression of all of these markers compared with PC animals.Clinical implicationsLi-EWST may have utility in the management of severe ED related to PNVI from severe pelvic injury or radical pelvic surgeries, even when administered in a delayed fashion.Strength & limitationsThis study of a severe ED phenotype involved treatment administered in a delayed fashion, which is more consistent with how therapy likely would be delivered in a real-world clinical context. Moreover, because the treatment commenced at 4 weeks after injury, when nerve and tissue atrophy have already occurred, the results imply that Li-ESWT can be used for regenerative therapy. Additional studies on dose optimization and treatment interval are needed to inform the design of human clinical trials.ConclusionLi-ESWT ameliorates the negative functional and histologic effects of severe pelvic neurovascular injury in a rat model system. 3DISCO provides high-resolution images of neuroanatomy and neural regeneration. Wang HS, Ruan Y, Banie L, et al. Delayed Low-Intensity Extracorporeal Shock Wave Therapy Ameliorates Impaired Penile Hemodynamics in Rats Subjected to Pelvic Neurovascular Injury. J Sex Med 2019;16:17-26